{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>\n",
    "\n",
    "<i>Licensed under the MIT License.</i>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# LSTUR: Neural News Recommendation with Long- and Short-term User Representations\n",
    "LSTUR \\[1\\] is a news recommendation approach capturing users' both long-term preferences and short-term interests. The core of LSTUR is a news encoder and a user encoder.  In the news encoder, we learn representations of news from their titles. In user encoder, we propose to learn long-term\n",
    "user representations from the embeddings of their IDs. In addition, we propose to learn short-term user representations from their recently browsed news via GRU network. Besides, we propose two methods to combine\n",
    "long-term and short-term user representations. The first one is using the long-term user representation to initialize the hidden state of the GRU network in short-term user representation. The second one is concatenating both\n",
    "long- and short-term user representations as a unified user vector.\n",
    "\n",
    "## Properties of LSTUR:\n",
    "- LSTUR captures users' both long-term and short term preference.\n",
    "- It uses embeddings of users' IDs to learn long-term user representations.\n",
    "- It uses users' recently browsed news via GRU network to learn short-term user representations.\n",
    "\n",
    "## Data format:\n",
    "For quicker training and evaluaiton, we sample MINDdemo dataset of 5k users from [MIND small dataset](https://msnews.github.io/). The MINDdemo dataset has the same file format as MINDsmall and MINDlarge. If you want to try experiments on MINDsmall and MINDlarge, please change the dowload source. Select the MIND_type parameter from ['large', 'small', 'demo'] to choose dataset.\n",
    " \n",
    "**MINDdemo_train** is used for training, and **MINDdemo_dev** is used for evaluation. Training data and evaluation data are composed of a news file and a behaviors file. You can find more detailed data description in [MIND repo](https://github.com/msnews/msnews.github.io/blob/master/assets/doc/introduction.md)\n",
    "\n",
    "### news data\n",
    "This file contains news information including newsid, category, subcatgory, news title, news abstarct, news url and entities in news title, entities in news abstarct.\n",
    "One simple example: <br>\n",
    "\n",
    "`N46466\tlifestyle\tlifestyleroyals\tThe Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By\tShop the notebooks, jackets, and more that the royals can't live without.\thttps://www.msn.com/en-us/lifestyle/lifestyleroyals/the-brands-queen-elizabeth,-prince-charles,-and-prince-philip-swear-by/ss-AAGH0ET?ocid=chopendata\t[{\"Label\": \"Prince Philip, Duke of Edinburgh\", \"Type\": \"P\", \"WikidataId\": \"Q80976\", \"Confidence\": 1.0, \"OccurrenceOffsets\": [48], \"SurfaceForms\": [\"Prince Philip\"]}, {\"Label\": \"Charles, Prince of Wales\", \"Type\": \"P\", \"WikidataId\": \"Q43274\", \"Confidence\": 1.0, \"OccurrenceOffsets\": [28], \"SurfaceForms\": [\"Prince Charles\"]}, {\"Label\": \"Elizabeth II\", \"Type\": \"P\", \"WikidataId\": \"Q9682\", \"Confidence\": 0.97, \"OccurrenceOffsets\": [11], \"SurfaceForms\": [\"Queen Elizabeth\"]}]\t[]`\n",
    "<br>\n",
    "\n",
    "In general, each line in data file represents information of one piece of news: <br>\n",
    "\n",
    "`[News ID] [Category] [Subcategory] [News Title] [News Abstrct] [News Url] [Entities in News Title] [Entities in News Abstract] ...`\n",
    "\n",
    "<br>\n",
    "\n",
    "We generate a word_dict file to tranform words in news title to word indexes, and a embedding matrix is initted from pretrained glove embeddings.\n",
    "\n",
    "### behaviors data\n",
    "One simple example: <br>\n",
    "`1\tU82271\t11/11/2019 3:28:58 PM\tN3130 N11621 N12917 N4574 N12140 N9748\tN13390-0 N7180-0 N20785-0 N6937-0 N15776-0 N25810-0 N20820-0 N6885-0 N27294-0 N18835-0 N16945-0 N7410-0 N23967-0 N22679-0 N20532-0 N26651-0 N22078-0 N4098-0 N16473-0 N13841-0 N15660-0 N25787-0 N2315-0 N1615-0 N9087-0 N23880-0 N3600-0 N24479-0 N22882-0 N26308-0 N13594-0 N2220-0 N28356-0 N17083-0 N21415-0 N18671-0 N9440-0 N17759-0 N10861-0 N21830-0 N8064-0 N5675-0 N15037-0 N26154-0 N15368-1 N481-0 N3256-0 N20663-0 N23940-0 N7654-0 N10729-0 N7090-0 N23596-0 N15901-0 N16348-0 N13645-0 N8124-0 N20094-0 N27774-0 N23011-0 N14832-0 N15971-0 N27729-0 N2167-0 N11186-0 N18390-0 N21328-0 N10992-0 N20122-0 N1958-0 N2004-0 N26156-0 N17632-0 N26146-0 N17322-0 N18403-0 N17397-0 N18215-0 N14475-0 N9781-0 N17958-0 N3370-0 N1127-0 N15525-0 N12657-0 N10537-0 N18224-0`\n",
    "<br>\n",
    "\n",
    "In general, each line in data file represents one instance of an impression. The format is like: <br>\n",
    "\n",
    "`[Impression ID] [User ID] [Impression Time] [User Click History] [Impression News]`\n",
    "\n",
    "<br>\n",
    "\n",
    "User Click History is the user historical clicked news before Impression Time. Impression News is the displayed news in an impression, which format is:<br>\n",
    "\n",
    "`[News ID 1]-[label1] ... [News ID n]-[labeln]`\n",
    "\n",
    "<br>\n",
    "Label represents whether the news is clicked by the user. All information of news in User Click History and Impression News can be found in news data file."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Global settings and imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "System version: 3.6.11 | packaged by conda-forge | (default, Aug  5 2020, 20:09:42) \n",
      "[GCC 7.5.0]\n",
      "Tensorflow version: 1.15.2\n"
     ]
    }
   ],
   "source": [
    "import sys\n",
    "sys.path.append(\"../../\")\n",
    "import os\n",
    "import numpy as np\n",
    "import zipfile\n",
    "from tqdm import tqdm\n",
    "import scrapbook as sb\n",
    "from tempfile import TemporaryDirectory\n",
    "import tensorflow as tf\n",
    "tf.get_logger().setLevel('ERROR') # only show error messages\n",
    "\n",
    "from reco_utils.recommender.deeprec.deeprec_utils import download_deeprec_resources \n",
    "from reco_utils.recommender.newsrec.newsrec_utils import prepare_hparams\n",
    "from reco_utils.recommender.newsrec.models.lstur import LSTURModel\n",
    "from reco_utils.recommender.newsrec.io.mind_iterator import MINDIterator\n",
    "from reco_utils.recommender.newsrec.newsrec_utils import get_mind_data_set\n",
    "\n",
    "print(\"System version: {}\".format(sys.version))\n",
    "print(\"Tensorflow version: {}\".format(tf.__version__))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Prepare Parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "epochs = 5\n",
    "seed = 40\n",
    "batch_size = 32\n",
    "\n",
    "# Options: demo, small, large\n",
    "MIND_type = 'demo'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download and load data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 17.0k/17.0k [00:01<00:00, 9.67kKB/s]\n",
      "100%|██████████| 9.84k/9.84k [00:01<00:00, 8.34kKB/s]\n",
      "100%|██████████| 95.0k/95.0k [00:08<00:00, 11.4kKB/s]\n"
     ]
    }
   ],
   "source": [
    "tmpdir = TemporaryDirectory()\n",
    "data_path = tmpdir.name\n",
    "\n",
    "train_news_file = os.path.join(data_path, 'train', r'news.tsv')\n",
    "train_behaviors_file = os.path.join(data_path, 'train', r'behaviors.tsv')\n",
    "valid_news_file = os.path.join(data_path, 'valid', r'news.tsv')\n",
    "valid_behaviors_file = os.path.join(data_path, 'valid', r'behaviors.tsv')\n",
    "wordEmb_file = os.path.join(data_path, \"utils\", \"embedding.npy\")\n",
    "userDict_file = os.path.join(data_path, \"utils\", \"uid2index.pkl\")\n",
    "wordDict_file = os.path.join(data_path, \"utils\", \"word_dict.pkl\")\n",
    "yaml_file = os.path.join(data_path, \"utils\", r'lstur.yaml')\n",
    "\n",
    "mind_url, mind_train_dataset, mind_dev_dataset, mind_utils = get_mind_data_set(MIND_type)\n",
    "\n",
    "if not os.path.exists(train_news_file):\n",
    "    download_deeprec_resources(mind_url, os.path.join(data_path, 'train'), mind_train_dataset)\n",
    "    \n",
    "if not os.path.exists(valid_news_file):\n",
    "    download_deeprec_resources(mind_url, \\\n",
    "                               os.path.join(data_path, 'valid'), mind_dev_dataset)\n",
    "if not os.path.exists(yaml_file):\n",
    "    download_deeprec_resources(r'https://recodatasets.z20.web.core.windows.net/newsrec/', \\\n",
    "                               os.path.join(data_path, 'utils'), mind_utils)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create hyper-parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "data_format=news,iterator_type=None,support_quick_scoring=True,wordEmb_file=/tmp/tmpcpgw9veg/utils/embedding.npy,wordDict_file=/tmp/tmpcpgw9veg/utils/word_dict.pkl,userDict_file=/tmp/tmpcpgw9veg/utils/uid2index.pkl,vertDict_file=None,subvertDict_file=None,title_size=30,body_size=None,word_emb_dim=300,word_size=None,user_num=None,vert_num=None,subvert_num=None,his_size=50,npratio=4,dropout=0.2,attention_hidden_dim=200,head_num=4,head_dim=100,cnn_activation=relu,dense_activation=None,filter_num=400,window_size=3,vert_emb_dim=100,subvert_emb_dim=100,gru_unit=400,type=ini,user_emb_dim=50,learning_rate=0.0001,loss=cross_entropy_loss,optimizer=adam,epochs=5,batch_size=32,show_step=100000,metrics=['group_auc', 'mean_mrr', 'ndcg@5;10']\n"
     ]
    }
   ],
   "source": [
    "hparams = prepare_hparams(yaml_file, \n",
    "                          wordEmb_file=wordEmb_file,\n",
    "                          wordDict_file=wordDict_file, \n",
    "                          userDict_file=userDict_file,\n",
    "                          batch_size=batch_size,\n",
    "                          epochs=epochs)\n",
    "print(hparams)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "iterator = MINDIterator"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train the LSTUR model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Tensor(\"conv1d/Relu:0\", shape=(?, 30, 400), dtype=float32)\n",
      "Tensor(\"att_layer2/Sum_1:0\", shape=(?, 400), dtype=float32)\n"
     ]
    }
   ],
   "source": [
    "model = LSTURModel(hparams, iterator, seed=seed)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "586it [00:03, 155.76it/s]\n",
      "236it [00:09, 26.08it/s]\n",
      "7538it [00:00, 7590.51it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'group_auc': 0.5201, 'mean_mrr': 0.2214, 'ndcg@5': 0.2292, 'ndcg@10': 0.2912}\n"
     ]
    }
   ],
   "source": [
    "print(model.run_eval(valid_news_file, valid_behaviors_file))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "1086it [02:23,  7.55it/s]\n",
      "586it [00:01, 430.29it/s]\n",
      "236it [00:08, 28.16it/s]\n",
      "7538it [00:01, 6738.86it/s]\n",
      "1it [00:00,  7.26it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "at epoch 1\n",
      "train info: logloss loss:1.4868141242592814\n",
      "eval info: group_auc:0.5973, mean_mrr:0.2622, ndcg@10:0.3501, ndcg@5:0.2861\n",
      "at epoch 1 , train time: 143.8 eval time: 18.5\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "1086it [02:18,  7.85it/s]\n",
      "586it [00:01, 455.05it/s]\n",
      "236it [00:08, 28.32it/s]\n",
      "7538it [00:01, 6669.92it/s]\n",
      "1it [00:00,  8.64it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "at epoch 2\n",
      "train info: logloss loss:1.3999453176011916\n",
      "eval info: group_auc:0.6219, mean_mrr:0.2803, ndcg@10:0.3726, ndcg@5:0.3099\n",
      "at epoch 2 , train time: 138.3 eval time: 19.2\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "1086it [02:18,  7.83it/s]\n",
      "586it [00:01, 448.54it/s]\n",
      "236it [00:08, 28.40it/s]\n",
      "7538it [00:00, 8089.03it/s]\n",
      "1it [00:00,  8.04it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "at epoch 3\n",
      "train info: logloss loss:1.3563778104044455\n",
      "eval info: group_auc:0.6281, mean_mrr:0.285, ndcg@10:0.3785, ndcg@5:0.3159\n",
      "at epoch 3 , train time: 138.7 eval time: 18.2\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "1086it [02:18,  7.84it/s]\n",
      "586it [00:01, 431.78it/s]\n",
      "236it [00:08, 28.00it/s]\n",
      "7538it [00:01, 7187.47it/s]\n",
      "1it [00:00,  8.33it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "at epoch 4\n",
      "train info: logloss loss:1.3173029956786892\n",
      "eval info: group_auc:0.6369, mean_mrr:0.2913, ndcg@10:0.3851, ndcg@5:0.3225\n",
      "at epoch 4 , train time: 138.5 eval time: 18.5\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "1086it [02:18,  7.84it/s]\n",
      "586it [00:01, 416.18it/s]\n",
      "236it [00:08, 28.36it/s]\n",
      "7538it [00:01, 7087.70it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "at epoch 5\n",
      "train info: logloss loss:1.2810899292017655\n",
      "eval info: group_auc:0.6462, mean_mrr:0.3031, ndcg@10:0.3983, ndcg@5:0.3349\n",
      "at epoch 5 , train time: 138.5 eval time: 18.4\n",
      "CPU times: user 25min 40s, sys: 2min 21s, total: 28min 2s\n",
      "Wall time: 13min 10s\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<reco_utils.recommender.newsrec.models.lstur.LSTURModel at 0x7f690ddf8b70>"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "model.fit(train_news_file, train_behaviors_file, valid_news_file, valid_behaviors_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "586it [00:01, 440.26it/s]\n",
      "236it [00:08, 28.51it/s]\n",
      "7538it [00:00, 9166.73it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'group_auc': 0.6462, 'mean_mrr': 0.3031, 'ndcg@5': 0.3349, 'ndcg@10': 0.3983}\n",
      "CPU times: user 37.1 s, sys: 2.69 s, total: 39.8 s\n",
      "Wall time: 18.1 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "res_syn = model.run_eval(valid_news_file, valid_behaviors_file)\n",
    "print(res_syn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sb.glue(\"res_syn\", res_syn)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Save the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_path = os.path.join(data_path, \"model\")\n",
    "os.makedirs(model_path, exist_ok=True)\n",
    "\n",
    "model.model.save_weights(os.path.join(model_path, \"lstur_ckpt\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Output Prediction File\n",
    "This code segment is used to generate the prediction.zip file, which is in the same format in [MIND Competition Submission Tutorial](https://competitions.codalab.org/competitions/24122#learn_the_details-submission-guidelines).\n",
    "\n",
    "Please change the `MIND_type` parameter to `large` if you want to submit your prediction to [MIND Competition](https://msnews.github.io/competition.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "586it [00:01, 438.04it/s]\n",
      "236it [00:08, 28.26it/s]\n",
      "7538it [00:00, 8876.72it/s]\n"
     ]
    }
   ],
   "source": [
    "group_impr_indexes, group_labels, group_preds = model.run_fast_eval(valid_news_file, valid_behaviors_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "7538it [00:00, 44730.54it/s]\n"
     ]
    }
   ],
   "source": [
    "with open(os.path.join(data_path, 'prediction.txt'), 'w') as f:\n",
    "    for impr_index, preds in tqdm(zip(group_impr_indexes, group_preds)):\n",
    "        impr_index += 1\n",
    "        pred_rank = (np.argsort(np.argsort(preds)[::-1]) + 1).tolist()\n",
    "        pred_rank = '[' + ','.join([str(i) for i in pred_rank]) + ']'\n",
    "        f.write(' '.join([str(impr_index), pred_rank])+ '\\n')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "f = zipfile.ZipFile(os.path.join(data_path, 'prediction.zip'), 'w', zipfile.ZIP_DEFLATED)\n",
    "f.write(os.path.join(data_path, 'prediction.txt'), arcname='prediction.txt')\n",
    "f.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reference\n",
    "\\[1\\] Mingxiao An, Fangzhao Wu, Chuhan Wu, Kun Zhang, Zheng Liu and Xing Xie: Neural News Recommendation with Long- and Short-term User Representations, ACL 2019<br>\n",
    "\\[2\\] Wu, Fangzhao, et al. \"MIND: A Large-scale Dataset for News Recommendation\" Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. https://msnews.github.io/competition.html <br>\n",
    "\\[3\\] GloVe: Global Vectors for Word Representation. https://nlp.stanford.edu/projects/glove/"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python (reco_gpu)",
   "language": "python",
   "name": "reco_gpu"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
